Inference of Multiscale Gaussian Graphical Model

Authors
Affiliations
Published

August 28, 2022

Abstract

Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.

build status Creative Commons License

1 Introduction

Probabilistic graphical models (Lauritzen 1996; Koller and Friedman 2009) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models reduce the number of parameters by selecting the most relevant interactions between variables. Undirected Gaussian Graphical Models (GGMs) are a class of graphical models used in Gaussian settings. In the context of high-dimensional statistics, graphical models are generally assumed sparse, meaning that a small number of variables interact, compared to the total number of possible interactions. This assumption has been shown to provide both statistical and computational advantages by simplifying the structure of dependence between variables (Dempster 1972) and allowing efficient algorithms (Meinshausen and Bühlmann 2006). See, for instance, (Fan, Liao, and Liu 2016) for a review about sparse graphical models inference.

In GGMs, it is well known (see, e.g., Lauritzen (1996)) that inferring the graphical model or equivalently the Conditional Independence Graph (CIG) boils down to inferring the support of the precision matrix \mathbf{\Omega} (the inverse of the variance-covariance matrix). To do so, several \ell_1 penalized methods have been proposed in the literature to learn the CIG of GGMs. For instance, the neighborhood selection (Meinshausen and Bühlmann 2006) (MB) based on a nodewise regression approach via the least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)) is a popular method. Each variable is regressed on the others, taking advantage of the link between the so-obtained regression coefficients and partial correlations. More precisely, for all 1 \le i \le p, the following problem is solved:

\hat{\boldsymbol{\beta}^i}(\lambda) = \underset{\boldsymbol{\beta}^i \in \mathbb{R}^{p-1}}{\operatorname{argmin}} \frac{1}{n} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda \left \lVert \boldsymbol{\beta}^i \right \rVert_1. \tag{1}

In Equation 1, \lambda is a non negative regularization parameter and \mathbf{X}^{\setminus i} denotes the matrix \mathbf{X} deprived of column i. The MB method defined by the estimation problem in Equation 1 has generated a long line of work in the field of nodewise regression methods. For instance, Rocha, Zhao, and Yu (2008), Ambroise, Chiquet, and Matias (2009) showed that nodewise regression could be seen as a pseudo-likelihood approximation, and Peng et al. (2009) extended the MB method to estimate sparse partial correlations using a single regression problem. Other inference methods similar to nodewise regression include a method based on Dantzig selector (Yuan 2010) and the introduction of the Clime estimator (Cai, Liu, and Luo 2011).

Another family of sparse CIG inference methods directly estimates \mathbf{\Omega} via direct minimization of the \ell_1-penalized negative log-likelihood (Banerjee, El Ghaoui, and d’Aspremont 2008), without resorting to the auxiliary regression problem. This method, called the graphical lasso (Friedman, Hastie, and Tibshirani 2007), benefits from many optimization algorithms (Yuan and Lin 2007; Rothman et al. 2008; Banerjee, El Ghaoui, and d’Aspremont 2008; Hsieh et al. 2014).

Such inference methods are widely used and enjoy many favorable theoretical and empirical properties, including robustness to high-dimensional problems. However, some limitations have been observed, particularly in the presence of strongly correlated variables. These limitations are caused by known impairments of Lasso-type regularization in this context (Bühlmann et al. 2012; Park, Hastie, and Tibshirani 2006). To overcome this, in addition to sparsity, several previous works attempt to estimate CIG by integrating clustering structures among variables for the sake of both statistical sanity and interpretability. A non-exhaustive list of works that integrate a clustering structure to speed up or improve the estimation procedure includes (Honorio et al. 2009; Ambroise, Chiquet, and Matias 2009; Mazumder and Hastie 2012; Tan, Witten, and Shojaie 2013; Yao and Allen 2019; Devijver and Gallopin 2018).

The above methods exploit the group structure to simplify the graph inference problem and infer the CIG between single variables. Another question that has received less attention is the inference of the CIG between the groups of variables, i.e., between the meta-variables representative of the group structure. A recent work introducing inference of graphical models on multiple grouping levels is (Cheng, Shan, and Kim 2017). They proposed inferring the CIG of gene data on two levels corresponding to genes and pathways, respectively. Note that pathways are groups of functionally related genes known in advance. The inference is achieved by optimizing a penalized maximum likelihood that estimates a sparse network at both gene and group levels. Our work is also part of this dynamic. We introduce a penalty term allowing parsimonious networks to be built at different hierarchical clustering levels. The main difference with the procedure of (Cheng, Shan, and Kim 2017) is that we do not require prior knowledge of the group structure, which makes the problem significantly more complex. In addition, our method has the advantage of proposing CIGs at more than two levels of granularity.

We introduce the Multiscale Graphical Lasso (MGLasso), a novel method to estimate simultaneously a hierarchical clustering structure, and graphical models depicting the conditional independence structure between clusters of variables at each level of the hierarchy. The procedure is based on a convex optimization problem with a hybrid penalty term combining a graphical Lasso and a group-fused Lasso penalty. In the spirit of convex hierarchical clustering, introduced by (Hocking et al. 2011; Lindsten, Ohlsson, and Ljung 2011), the hierarchy is obtained by spanning the entire regularization path. At each level of the hierarchy, variables in the same clusters are represented by a meta-variable; the new CIG is estimated between these new meta-variables, leading to a multiscale graphical model. Unlike (Yao and Allen 2019), who introduced convex clustering in GGMs, our approach is expected to produce sparse estimations, thanks to an additional \ell_1 penalty.

The remainder of this paper is organized as follows. In section Multiscale Graphical Lasso, we formally introduce the Multiscale Graphical Lasso based on a convex estimation problem and an optimization algorithm based on the continuation of Nesterov’s smoothing technique (Hadj-Selem et al. 2018). Section Simulation experiments presents numerical results on simulated and real data.

2 Multiscale Graphical Lasso

The proposed method aims at inferring a graphical Gaussian model while hierarchically grouping variables. It infers conditional independence between different groups of variables. The approach is based on neighborhood selection (Meinshausen and Bühlmann 2006) and considers an additional fused-Lasso type penalty for clustering. In the spirit of hierarchical convex clustering, the hierarchical structure is recovered by spanning the regularization path.

Let X = (X^1, \dots, X^p)^T \in \mathbb R^p be a p-dimensional Gaussian random vector, with mean vector \mu and covariance matrix \mathbf \Sigma. The conditional independence structure of X is characterized by a graph G = (V, E), where V = \{1,\ldots p\} is the set of variables and E the set of edges, uniquely determined by the support of the precision matrix \mathbf{\Omega} = \mathbf{S}^{-1} (see, e.g., Dempster (1972)). In other words, for any two vertices i,j\in V, the edge (i,j) belongs to the set E if and only if \Omega_{ij} \neq 0, that is if and only if the i-th and j-th variables are conditionally independent given all the others i.e. X^i \perp \!\!\! \perp X^j |X^{\setminus (i, j)} where X^{\setminus (i, j)} is the set of all p variables deprived of variables i and j.

Considering the linear model X^i = \sum_{j\neq i} \beta^i_j X_j + \epsilon_i where \epsilon_i is a Gaussian centered random variable, we have \beta^i_j = -\frac{\Omega_{ij}}{\Omega_{ii}}. We define the regression matrix \boldsymbol{\beta} := [{\boldsymbol{\beta}^1}^T, \ldots, {\boldsymbol{\beta}^p}^T]^T \in \mathbb{R}^{p \times (p-1)}, whose rows are the regression vectors for each of the p regressions.

Let the n \times p-dimensional matrix \mathbf{X} contain n independent observations of X. We propose to minimize the following criterion which combines Lasso and group-fused Lasso penalties:

J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}; \mathbf{X} ) = \frac{1}{2} \sum_{i=1}^p \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{2}

where \tau_{ij} is a permutation exchanging the coefficients \boldsymbol{\beta}^j_j and \boldsymbol{\beta}^j_i and leaves other coefficients untouched, \mathbf{X}^{i}\in \mathbb{R}^n denotes the i-th column of \mathbf{X}, \boldsymbol{\beta}_{i} denotes the i-th row of \beta, \lambda_1 and \lambda_2 are penalization parameters. Let us consider \hat{\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}}{\operatorname{argmin}} J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}, \mathbf{X}).

The lasso penalty term encourages sparsity and the penalty term \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 encourages to fuse regression vectors \boldsymbol{\beta}^i and \boldsymbol{\beta}^j. These fusions uncover a clustering structure. The model is likely to cluster together variables that have the same conditional effects on the others. Variables X^i and X^j are assigned to the same cluster when \boldsymbol{\beta}^i = \tau_{ij}(\boldsymbol{\beta}^j).

Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.

The greater the regularization weight \lambda_2, the larger groups become. This is the core principle of the convex relaxation of hierarchical clustering introduced by Hocking et al. (2011). Hence, we can derive a hierarchical clustering structure by spanning the regularization path obtained by varying \lambda_2 while \lambda_1 is fixed. The addition of a fused-type term in graphical models inference has been studied previously by authors such as Honorio et al. (2009), Ganguly and Polonik (2014), Grechkin et al. (2015). However, these existing methods require prior knowledge of the neighborhood of each variable. On the contrary, our approach allows simultaneous inference of a multi-level graphical model and a hierarchical clustering of the variables.

In practice, if some information about the clustering structure is available, the problem can be generalized into:

\min_{\boldsymbol{\beta}} \sum_{i=1}^p\frac{1}{2} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} w_{ij} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{3}

where w_{ij} is a positive weight encoding prior knowledge of the groups to which variables i and j belong to. In the remainder of the paper, we will consider w_{ij} = 1.

3 Numerical scheme

This section introduces a complete numerical scheme to apply MGLasso in practice, using a convex optimization algorithm and a model selection procedure. Section Optimization via CONESTA algorithm reviews the principles of the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), the optimization algorithm used in practice to solve MGLasso. Section Reformulation of MGLasso for CONESTA algorithm details a reformulation of MGLasso, which enables us to apply CONESTA. Finally, Section Model selection presents the procedure used to select the regularization parameters.

3.1 Optimization via CONESTA algorithm

The optimization problem for Multiscale Graphical Lasso is convex but not straightforward to solve using classical algorithms because of the fused-lasso type penalty, which is non-separable and admits no closed-form solution for the proximal gradient. We rely on the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), dedicated to high-dimensional regression problems with structured sparsity such as group structures.

The CONESTA solver, initially introduced for neuro-imaging problems, addresses a general class of convex optimization problems which includes group-wise penalties, admitting loss functions of the form:

f(\boldsymbol{\theta} ) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}), \tag{4}

where \lambda_1 and \lambda_2 are penalty weights, and \boldsymbol{\theta}\in \mathbb{R}^d is a d-dimensional vector of parameters to optimize. In the original paper (Hadj-Selem et al. 2018), the function g(\boldsymbol{\theta}) is the sum of a least squares criterion and a ridge penalty, h(\boldsymbol{\theta}) is a penalty whose proximal operator is known in closed-form, and s(\boldsymbol{\theta}) is an \ell_{1,2} penalty of the form

s(\boldsymbol{\theta}) = \sum_{\phi \in \Phi} \|\mathbf{D}_\phi \boldsymbol{\theta}_\phi\|_2. In the definition of s(\boldsymbol{\theta}), \Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\} is a set of subsets of indices, i.e., \Phi_i\subset \{1,\ldots, d\} for all i\in\{1,\ldots,\operatorname{Card}(\Phi)\} and, for all \phi\in \Phi, \boldsymbol{\theta}_\phi is the sub-vector of \boldsymbol{\theta} defined by \boldsymbol{\theta}_\phi = (\theta_i)_{i\in\phi}. Finally, \mathbf{D}_\phi are linear operators. The main ingredient of CONESTA is the approximation of the non-smooth \ell_{2,1}-norm penalty with unknown proximal gradient, by a smooth function with known proximal gradient computed using Nesterov’s smoothing. Given a smoothing parameter \mu>0, let us define the smooth approximation

s_{\mu}(\boldsymbol{\theta}) = \max_{\boldsymbol{\alpha} \in \mathcal{K}} \left \{ \boldsymbol{\alpha}^T \mathbf{D} \boldsymbol{\theta} - \frac{\mu}{2} \| \boldsymbol{\alpha} \|_2^2 \right \}, where \mathcal{K} is the \ell_2 unit ball. Note that \lim_{\mu \rightarrow 0} s_{\mu}(\boldsymbol{\theta}) = s(\boldsymbol{\theta}). An accelerated proximal gradient algorithm (FISTA, Beck and Teboulle (2009)) step can then be applied after computing the gradient of the smooth part of the approximated criterion which is given by g(\boldsymbol{\theta}) + s_{\mu}(\boldsymbol{\theta}). At each iteration, the smoothing parameter \mu is updated dynamically using the duality gap, and a new approximation is computed. The CONESTA algorithm enjoys a linear convergence rate, and was shown empirically to outperform other computational options for structured-sparsity problems such as ADMM and inexact FISTA in terms of convergence speed (Hadj-Selem et al. 2018).

3.2 Reformulation of MGLasso for CONESTA algorithm

To apply CONESTA, it is necessary to reformulate the MGLasso problem in order to comply with the form of loss function required by CONESTA. The objective of MGLasso can indeed be written as

\operatorname{argmin} \frac{1}{2} ||\mathbf{Y} - \tilde{\mathbf{X}} \tilde{\boldsymbol{\beta}}||_2^2 + \lambda_1 ||\tilde{\boldsymbol{\beta}}||_1 + \lambda_2 \sum_{i<j} ||\boldsymbol A_{ij} \tilde{\boldsymbol{\beta}}||_2, \tag{5}

where \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol A_{ij} is a p\times p(p-1) matrix defined by

\boldsymbol A_{ij} (k, l) = \begin{cases} 1, \ \text{if} \ l =(i-1)p+k, \\ -1, \ \text{if} \ l = (j-1)p+k,\\ 0, \ \text{otherwise.} \end{cases}

Note that we introduce this notation for simplicity of exposition, but, in practice, the sparsity of the matrices A_{ij} allows a more efficient implementation. Based on reformulation Equation 5, we may apply CONESTA to solve the objective of MGLasso for fixed \lambda_1 and \lambda_2. The procedure is applied, for fixed \lambda_1, to a range of decreasing values of \lambda_2 to obtain a hierarchical clustering. The corresponding pseudo-code is given in the following algorithm where (\mathbf{X}^i)^{+} denotes the pseudo-inverse of \mathbf{X}^i and \epsilon_{fuse} the threshold for merging clusters.

\begin{algorithm}
\caption{MGLasso algorithm}
\begin{algorithmic}
\STATE \textbf{Input}: data $\mathbf{X} \in \mathbb R^{n\times p}$, $\lambda_1$, starting value $\lambda_2$, step $\kappa > 1$ \\ $\operatorname{clusters} \leftarrow \left \{\{1\}, \dots, \{p\}\right \}$ \\ $\boldsymbol{\beta}^i \leftarrow (\mathbf{X}^i)^{+}\mathbf{X}^i$

\WHILE{$\operatorname{Card}(\operatorname{clusters}) \ge 2$}
  \STATE $\boldsymbol{\beta} \leftarrow CONESTA(\mathbf{X}, \boldsymbol A, \lambda_1, \lambda_2)$
  \FOR{$i = 1$ \TO $p$}
    \FOR{$j = i$ \TO $p$}
      \IF{$\operatorname{dist}(\boldsymbol{\beta}^i, \boldsymbol{\beta}^j) < \epsilon_{fuse}$}
        \STATE $\operatorname{clusters}_i = \operatorname{clusters}_j$
      \ENDIF
    \ENDFOR
  \ENDFOR
  \STATE \textbf{Update}: $\operatorname{clusters}$
  \STATE $\lambda_2 \leftarrow \lambda_2 \times \kappa$
\ENDWHILE
\STATE \textbf{Return} $\boldsymbol{\beta}$ for each couple $(\lambda_1, \lambda_2$)
\end{algorithmic}
\end{algorithm}

3.3 Model selection

A crucial question for practical applications is the definition of a rule to select the penalty parameters (\lambda_1, \lambda_2). This selection problem operates at two levels: \lambda_1 controls the sparsity of the graphical model, and \lambda_2 controls the number of clusters in the optimal clustering partition. These two parameters are dealt with separately: the sparsity parameter \lambda_1 is chosen via model selection, while the clustering parameter \lambda_2 varies across a grid of values, in order to obtain graphs with different levels of granularity. The problem of model selection in graphical models is difficult in the high dimensional case where the number of samples is small compared to the number of variables, as classical AIC and BIC criteria tend to perform poorly (Liu, Roeder, and Wasserman 2010). Alternative criteria have been proposed in the literature, such as cross-validation (Bien and Tibshirani 2011), GGMSelect (Giraud, Huet, and Verzelen 2012), stability selection (Meinshausen and Bühlmann 2010; Liu, Roeder, and Wasserman 2010), Extended Bayesian Information Criterion (EBIC) (Foygel and Drton 2010), and Rotation Information Criterion (Zhao et al. 2012).

In this paper, we focused on the StARS stability selection approach proposed by Liu, Roeder, and Wasserman (2010). The method uses k subsamples of data to estimate the associated graphs for a given range of \lambda_1 values. For each value, a global instability of the graph edges is computed. The optimal value of \lambda_1 is chosen so as to minimize the instability, as follows. Let \lambda^{(1)}_1, \dots, \lambda_1^{(K)} be a grid of sparsity regularization parameters, and S_1, \dots, S_N be N bootstrap samples obtained by sampling the rows of the data set \mathbf{X}. For each k\in\{1,\ldots,K\} and for each j\in\{1,\ldots, N\}, we denote by \mathcal{A}^{k,j}(\mathbf{X}) the adjacency matrix of the estimated graph obtained by applying the inference algorithm to S_n with regularization parameter \lambda_1^{(k)}. For each possible edge (s,t)\in\{1,\ldots,p\}^2, the probability of edge appearance is estimated empirically by \hat \theta_{st}^{(k)} = \frac{1}{N} \sum_{j=1}^N \mathcal{A}^{k,j}_{st}. Define \hat \xi_{st}(\Lambda) = 2 \hat \theta_{st} (\Lambda) \left ( 1 - \hat \theta_{st} (\Lambda) \right ) the empirical instability of edge (s,t) (that is, twice the variance of the Bernoulli indicator of edge (s,t)). The instability level associated to \lambda_1^{(k)} is given by

\hat D(\lambda_1^{(k)}) = \frac{\sum_{s<t} \hat \xi_{st}(\lambda_1^{(k)})}{p \choose 2}, StARS selects the optimal penalty parameter as follows

\hat \lambda = \max_k\left\{ \lambda_1^{(k)}: \hat D(\lambda_1^{(k)}) \le \beta, k\in\{1,\ldots,K\} \right \},

where \beta is the threshold chosen for the instability level.

4 Simulation experiments

In this section, we conduct a simulation study to evaluate the performance of the MGLasso method, both in terms of clustering and support recovery. Receiver Operating Characteristic (ROC) curves are used to evaluate the adequacy of the inferred graphs with the reality for the MGLasso and GLasso methods in the Erdös-Renyi, Scale-free, and Stochastic Block Models frameworks. The Adjusted Rand indices are used to compare the partitions obtained with MGLasso, hierarchical agglomerative clustering, and K-means clustering in a stochastic block model framework.

4.1 Synthetic data models

We consider three different synthetic network models: the Stochastic Block Model (SBM, Fienberg and Wasserman (1981)), the Erdös-Renyi model (Erdős, Rényi, et al. 1960) and the Scale-Free model (Newman, Strogatz, and Watts 2001). In each case, Gaussian data is generated by drawing n independent realizations of a multivariate Gaussian distribution \mathcal N(0, \mathbf{\Sigma}) where \mathbf{\Sigma} \in \mathbb{R}^{p \times p} and \mathbf{\Omega} = \mathbf{\Sigma} ^{-1}. The support of \mathbf{\Omega}, equivalent to the network adjacency matrix, is generated from the three different models. The difficulty level of the problem is controlled by varying the ratio \frac{n}{p} with p fixed at 40: \frac{n}{p}\in \{0.5,1,2\}.

4.1.1 Stochastic Block-Model

We construct a block-diagonal precision matrix \mathbf{\Omega} as follows. First, we generate the support of \mathbf{\Omega} as shown in Figure 1, denoted by \boldsymbol A\in\{0,1\}^{p\times p}. To do this, the variables are first partitioned into K = 5 hidden groups, noted C_1, \dots, C_K described by a latent random variable Z_i, such that Z_i = k if i = C_k. Z_i follows a multinomial distribution P(Z_i = k) = \pi_k, \quad \forall k \in \{1, \dots, K\}, where \pi = (\pi_1, \dots, \pi_k) is the vector of proportions of clusters whose sum is equal to one. The set of latent variables is noted \mathbf{Z} = \{ Z_1, \dots, Z_K\}. Conditionally to \mathbf{Z}, A_{ij} follows a Bernoulli distribution such that A_{ij}|Z_i = k, Z_j = l \sim \mathcal{B}(\alpha_{kl}), \quad \forall k,l \in \{1 \dots, K\}, where \alpha_{kl} is the probability of inter-cluster connectivity, with \alpha_{kl} = 0.01 if k\neq l and \alpha_{ll} = 0,75. For k\in\{1,\ldots, K\}, we define p_k = \sum_{i=1}^p \boldsymbol{1}_{\{Z_i = k\}}. The precision matrix \mathbf{\Omega} of the graph is then calculated as follows. We define \Omega_{ij} = 0 if Z_i\neq Z_j ; otherwise, we define \Omega_{ij} = A_{ij}\omega_{ij} where, for all i\in\{1,\ldots,p\} and for all j\in\{1,\ldots,p| Z_j = Z_i\}, \omega_{ij} is given by :

\begin{equation} \begin{aligned} &\omega_{ii} := \frac{1+\rho(p_{Z_i}-2)}{1+\rho(p_{Z_i}-2)-\rho^2(p_{Z_i}-1)};\\ &\omega_{ij} := \frac{-\rho}{1+\rho(p_{Z_i}-2)-\rho^2(p_{Z_i}-1)}. \end{aligned} \end{equation}

If \alpha_{ll} were to be equal to one, this construction of \mathbf{\Omega} would make it possible to control the level of correlation between the variables in each block to \rho. Introducing a more realistic scheme with \alpha_{ll}=0.75 allows only to have an approximate control.

Figure 1: Adjacency matrix of a stochastic block model with 5 blocks.

4.1.2 Erdös-Renyi Model

The Erdös-Renyi model is a special case of the stochastic block model where \alpha_{kl} = \alpha_{ll} = \alpha is constant. We set the density \alpha of the graph to 0.1; see Figure 2 for an example of the graph resulting from this model.

Figure 2: Adjacency matrix of an Erdös-Renyi model.

4.1.3 Scale-free Model

The Scale-free Model generates networks whose degree distributions follow a power law. The graph starts with an initial chain graph of 2 nodes. Then, new nodes are added to the graph one by one. Each new node is connected to an existing node with a probability proportional to the degree of the existing node. We set the number of edges in the graph to 40. An example of scale-free graph is shown in Figure 3.

Figure 3: Adjacency matrix of a Scale-free model.

4.2 Support recovery

We compare the network structure learning performance of our approach to that of GLasso in its neighborhood selection version using ROC curves. In both GLasso and MGLasso, the sparsity is controlled by a regularization parameter \lambda_1; however, MGLasso admits an additional regularization parameter, \lambda_2, which controls the strength of convex clustering. To compare the two methods, in each ROC curve, we vary the parameter \lambda_1 while the parameter \lambda_2 (for MGLasso) is kept constant. We computed ROC curves for 4 different penalty levels for the \lambda_2 parameter; since GLasso does not depend on \lambda_2, the GLasso ROC curves are replicated.

In a decision rule associated with a sparsity penalty level \lambda_1, we recall the definition of the two following functions. The sensitivity, also called the true positive rate or recall, is given by : \begin{align*} \lambda_1 &\mapsto \text{sensitivity}(\lambda_1) = \frac{TP(\lambda_1)}{TP(\lambda_1) + FN(\lambda_1)}. \end{align*} Specificity, also called true negative rate or selectivity, is defined as follows: \begin{align*} \lambda_1 &\mapsto \text{specificity}(\lambda_1) = \frac{TN(\lambda_1)}{TN(\lambda_1) + FP(\lambda_1)}. \end{align*} The ROC curve with the parameter \lambda_1 represents \text{sensitivity}(\lambda_1) as a function of 1 - \text{specificity}(\lambda_1) which is the false positive rate.

For each configuration (n, p fixed), we generate 50 replications and their associated ROC curves, which are then averaged. The average ROC curves for the three models are given in Figure 4, Figure 5 and Figure 6 by varying \frac{n}{p}\in \{0.5,1,2\}.

Figure 4: ROC curves for the Erdös-Renyi model comparing MGLasso and GLasso methods.

Figure 5: ROC curves for the Scale-free model comparing MGLasso and GLasso methods.

Figure 6: ROC curves for the Stochastic Block model comparing MGLasso and GLasso methods.

Based on these empirical results, we first observe that, in all the considered simulation models, MGLasso improves over GLasso in terms of support recovery in the high-dimensional setting where p<n. In addition, in the absence of a total variation penalty, i.e., \lambda_2 = 0, MGLasso performs no worse than GLasso in each of the 3 models. However, for \lambda_2>0, increasing penalty value does not seem to significantly improve the support recovery performances for the MGLasso, as we observe similar results for \lambda_2=3.3,6.6,10. Preliminary analyses show that, as \lambda_2 increases, the estimates of the regression vectors are shrunk towards 0. This shrinkage effect of group-fused penalty terms was also observed in (Chu et al. 2021).

4.3 Clustering

In order to obtain clustering performance, we compared the partitions estimated by MGLasso, Hierarchical Agglomerative Clustering (HAC) with Ward’s distance and K-means to the true partition in a stochastic block model framework. Euclidean distances between variables are used for HAC and K-means. The criterion used for the comparison is the adjusted Rand index. We studied the influence of the correlation level inside clusters on the clustering performances through two different parameters: \rho \in \{ 0.1, 0.3 \}; the vector of cluster proportions is fixed at \mathbf \pi = (1/5, \dots, 1/5). We then simulate 100 Gaussian data sets for each simulation configuration (\rho, n/p fixed).The optimal sparsity penalty for MGLasso is chosen by the Stability Approach to Regularization Selection (StARS) method (Liu, Roeder, and Wasserman 2010), and we vary the parameter \lambda_2.

Figure 7: Adjusted Rand Indices for the HAC, k-means and MGLasso methods. Performances are observed for 4 different number of clusters in a low correlation context

Figure 8: Adjusted Rand Indices for the HAC, k-means and MGLasso methods. Performances are observed for 4 different number of clusters in a high correlation context

The results shown in Figure 7 and Figure 8 demonstrate that, particularly at the lower to medium levels of the hierarchy (between 20 and 10 clusters), the hierarchical clustering structure uncovered by MGLasso is comparable to popular clustering methods used in practice. For the higher levels (5 clusters), the performances of MGLasso deteriorate. As expected, the three compared methods also deteriorate as the level of correlation inside clusters decreases.

5 Analysis of microbial associations data

We finally illustrate our new method of inferring the multiscale Gaussian graphical model, with an application to the analysis of microbial associations in the American Gut Project. The data used are count data that have been previously normalized by applying the log-centered ratio technique as used in (Kurtz 2015). After some filtering steps (Kurtz 2015) on the operational taxonomic units OTUs counts (removed if present in less than 37\% of the samples) and the samples (removed if sequencing depth below 2700), the top OTUs are grouped in a dataset composed of n_1 = 289 for 127 OTUs. As a preliminary analysis, we perform a hierarchical agglomerative clustering (HAC) on the OTUs, which allows us to identify four significant groups. The correlation matrix of the dataset is given in Figure 9; variables have been rearranged according to the HAC partition.

Figure 9: Empirical correlations in the gut data

The average correlations within blocks of variables belonging to the same cluster are given below. We observe relatively high levels of correlation in small blocks, similar to the simulation models used to evaluate the performance of clustering in the Simulation Experiments section.

Cluster Mean.correlation
1 0.013
2 0.815
3 0.555
4 0.566

We apply MGLasso to the normalised counts to infer a graph and a clustering structure. The graphs obtained by MGLasso for \lambda_2 = 0 and for \lambda_2 = 5 (corresponding respectively 127 and 80 clusters) are given below. In each case, the parameter \lambda_1 is chosen by stability selection (see Model Selection section).

Figure 10: Infered graphs using MGLasso for different values of total variation penalty

Figure 10: Infered graphs using MGLasso for different values of total variation penalty

The variables were reordered according to the clustering partition obtained from the distances between the regression vectors. Increasing \lambda_2 reduces the number of clusters and leads to a shrinking effect on the estimates. The adjacency matrix of the neighborhood selection equivalent to setting \lambda_2 to 0 is given in Figure 10 (left). In Figure 10 (right), the deduced partition is composed of 80 clusters. A confusion matrix comparing the edges deduced by MGLasso with \lambda_2 = 5 and neighborhood selection is given below. Adding a total variation parameter increases the merging effect, resulting in a larger number of edges in the graph.

Neighborhood selection
MGLasso (\lambda_2 = 5) non-edges edges
non-edges 15678 0
edges 288 163

6 Conclusion

We proposed a new technique that combines Gaussian Graphical Model inference and hierarchical clustering called MGLasso. The method proceeds via convex optimization and minimizes the neighborhood selection objective penalized by a hybrid regularization combining a sparsity-inducing norm and a convex clustering penalty. We developed a complete numerical scheme to apply MGLasso in practice, with an optimization algorithm based on CONESTA and a model selection procedure. Our simulations results over synthetic and real datasets showed that MGLasso can perform better than GLasso in network support recovery in the presence of groups of correlated variables, and we illustrated the method with the analysis of microbial associations data. The present work paves the way for future improvements: first, by incorporating prior knowledge through more flexible weighted regularization; second, by studying the theoretical properties of the method in terms of statistical guarantees for the MGLasso estimator.

Session information

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16

Matrix products: default
BLAS/LAPACK: /Users/runner/miniconda3/envs/computo/lib/libopenblasp-r0.3.21.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Matrix_1.4-1      huge_1.3.5        SpiecEasi_1.1.2   simone_1.0-4     
[5] blockmodels_1.1.5 ggplot2_3.3.6    

loaded via a namespace (and not attached):
 [1] shape_1.4.6       tidyselect_1.1.2  xfun_0.32         purrr_0.3.4      
 [5] splines_4.1.3     lattice_0.20-45   colorspace_2.0-3  vctrs_0.4.1      
 [9] generics_0.1.3    htmltools_0.5.3   stats4_4.1.3      yaml_2.3.5       
[13] utf8_1.2.2        survival_3.4-0    rlang_1.0.4       pillar_1.8.1     
[17] glue_1.6.2        withr_2.5.0       DBI_1.1.3         foreach_1.5.2    
[21] lifecycle_1.0.1   stringr_1.4.1     munsell_0.5.0     gtable_0.3.0     
[25] mvtnorm_1.1-3     htmlwidgets_1.5.4 codetools_0.2-18  VGAM_1.1-7       
[29] evaluate_0.16     labeling_0.4.2    knitr_1.40        fastmap_1.1.0    
[33] parallel_4.1.3    fansi_1.0.3       highr_0.9         Rcpp_1.0.9       
[37] scales_1.2.1      captioner_2.2.3   jsonlite_1.8.0    farver_2.1.1     
[41] pulsar_0.3.8      digest_0.6.29     stringi_1.7.8     dplyr_1.0.9      
[45] grid_4.1.3        cli_3.3.0         tools_4.1.3       magrittr_2.0.3   
[49] glmnet_4.1-4      tibble_3.1.8      pkgconfig_2.0.3   MASS_7.3-58.1    
[53] assertthat_0.2.1  rmarkdown_2.16    iterators_1.0.14  R6_2.5.1         
[57] igraph_1.3.4      compiler_4.1.3   

References

Ambroise, Christophe, Julien Chiquet, and Catherine Matias. 2009. Inferring sparse gaussian graphical models with latent structure.” Electronic Journal of Statistics 3 (0): 205–38. https://doi.org/10.1214/08-EJS314.
Banerjee, Onureena, Laurent El Ghaoui, and Alexandre d’Aspremont. 2008. “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data” 9 (June): 485–516.
Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM J. Imaging Sciences 2 (January): 183–202. https://doi.org/10.1137/080716542.
Bien, Jacob, and Robert J Tibshirani. 2011. “Sparse Estimation of a Covariance Matrix.” Biometrika 98 (4): 807–20.
Bühlmann, Peter, Philipp Rütimann, Sara Van De Geer, and Cun-Hui Zhang. 2012. Correlated variables in regression: clustering and sparse estimation.”
Cai, Tony, Weidong Liu, and Xi Luo. 2011. “A Constrained L1 Minimization Approach to Sparse Precision Matrix Estimation.” Journal of the American Statistical Association 106 (494): 594–607. https://doi.org/10.1198/jasa.2011.tm10155.
Cheng, Lulu, Liang Shan, and Inyoung Kim. 2017. Multilevel Gaussian graphical model for multilevel networks.” Journal of Statistical Planning and Inference 190 (November): 1–14. https://doi.org/10.1016/j.jspi.2017.05.003.
Chu, Shuyu, Huijing Jiang, Zhengliang Xue, and Xinwei Deng. 2021. “Adaptive Convex Clustering of Generalized Linear Models with Application in Purchase Likelihood Prediction.” Technometrics 63 (2): 171–83.
Dempster, A. P. 1972. Covariance Selection.” Biometrics 28 (1): 157. https://doi.org/10.2307/2528966.
Devijver, Emilie, and Mélina Gallopin. 2018. Block-Diagonal Covariance Selection for High-Dimensional Gaussian Graphical Models.” Journal of the American Statistical Association 113 (521): 306–14. https://doi.org/10.1080/01621459.2016.1247002.
Erdős, Paul, Alfréd Rényi, et al. 1960. “On the Evolution of Random Graphs.” Publ. Math. Inst. Hung. Acad. Sci 5 (1): 17–60.
Fan, Jianqing, Yuan Liao, and Han Liu. 2016. “An Overview of the Estimation of Large Covariance and Precision Matrices.” The Econometrics Journal 19 (1): C1–32. https://doi.org/https://doi.org/10.1111/ectj.12061.
Fienberg, Stephen E, and Stanley S Wasserman. 1981. “Categorical Data Analysis of Single Sociometric Relations.” Sociological Methodology 12: 156–92.
Foygel, Rina, and Mathias Drton. 2010. “Extended Bayesian Information Criteria for Gaussian Graphical Models.” arXiv Preprint arXiv:1011.6640.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2007. Sparse inverse covariance estimation with the graphical lasso.”
Ganguly, Apratim, and Wolfgang Polonik. 2014. “Local Neighborhood Fusion in Locally Constant Gaussian Graphical Models.” https://arxiv.org/abs/1410.8766.
Giraud, Christophe, Sylvie Huet, and Nicolas Verzelen. 2012. “Graph Selection with GGMselect.” Statistical Applications in Genetics and Molecular Biology 11 (3).
Grechkin, Maxim, Maryam Fazel, Daniela Witten, and Su-In Lee. 2015. “Pathway Graphical Lasso.” In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2617–23. AAAI’15. Austin, Texas: AAAI Press.
Hadj-Selem, Fouad, Tommy Lofstedt, Elvis Dohmatob, Vincent Frouin, Mathieu Dubois, Vincent Guillemot, and Edouard Duchesnay. 2018. Continuation of Nesterov’s Smoothing for Regression with Structured Sparsity in High-Dimensional Neuroimaging.” IEEE Transactions on Medical Imaging 2018. https://doi.org/10.1109/TMI.2018.2829802.
Hocking, T., Jean-Philippe Vert, F. Bach, and Armand Joulin. 2011. “Clusterpath: An Algorithm for Clustering Using Convex Fusion Penalties.” In ICML.
Honorio, Jean, Dimitris Samaras, Nikos Paragios, Rita Goldstein, and Luis E Ortiz. 2009. “Sparse and Locally Constant Gaussian Graphical Models.” Advances in Neural Information Processing Systems 22: 745–53.
Hsieh, Cho-Jui, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar. 2014. “QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation.” Journal of Machine Learning Research 15 (83): 2911–47. http://jmlr.org/papers/v15/hsieh14a.html.
Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles.” Italica 51 (3): 327. https://doi.org/10.2307/478142.
Kurtz, Christian L. AND Miraldi, Zachary D. AND Müller. 2015. “Sparse and Compositionally Robust Inference of Microbial Ecological Networks.” PLOS Computational Biology 11 (May): 1–25. https://doi.org/10.1371/journal.pcbi.1004226.
Lauritzen, Steffen L. 1996. Graphical models. Clarendon Press. https://global.oup.com/academic/product/graphical-models-9780198522195?cc=fr&lang=en&.
Lindsten, F., H. Ohlsson, and L. Ljung. 2011. “Clustering Using Sum-of-Norms Regularization: With Application to Particle Filter Output Computation.” In 2011 IEEE Statistical Signal Processing Workshop (SSP), 201–4. https://doi.org/10.1109/SSP.2011.5967659.
Liu, Han, Kathryn Roeder, and Larry Wasserman. 2010. Stability approach to regularization selection (StARS) for high dimensional graphical models.” Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010, 1–14.
Mazumder, Rahul, and Trevor Hastie. 2012. The graphical lasso: New insights and alternatives.” Electronic Journal of Statistics 6 (none): 2125–49. https://doi.org/10.1214/12-EJS740.
Meinshausen, Nicolai, and Peter Bühlmann. 2006. High-dimensional graphs and variable selection with the Lasso.” Annals of Statistics 34 (3): 1436–62. https://doi.org/10.1214/009053606000000281.
———. 2010. “Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73.
Newman, Mark EJ, Steven H Strogatz, and Duncan J Watts. 2001. “Random Graphs with Arbitrary Degree Distributions and Their Applications.” Physical Review E 64 (2): 026118.
Park, Mee Young, Trevor Hastie, and Robert Tibshirani. 2006. Averaged gene expressions for regression.” Biostatistics 8 (2): 212–27. https://doi.org/10.1093/biostatistics/kxl002.
Peng, Jie, Pei Wang, Nengfeng Zhou, and Ji Zhu. 2009. “Partial Correlation Estimation by Joint Sparse Regression Models.” Journal of the American Statistical Association 104 (486): 735–46. https://doi.org/10.1198/jasa.2009.0126.
Rocha, Guilherme V., Peng Zhao, and Bin Yu. 2008. “A Path Following Algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE).”
Rothman, Adam J., Peter J. Bickel, Elizaveta Levina, and Ji Zhu. 2008. Sparse permutation invariant covariance estimation.” Electronic Journal of Statistics 2 (none): 494–515. https://doi.org/10.1214/08-EJS176.
Tan, Kean Ming, Daniela Witten, and Ali Shojaie. 2013. The Cluster Graphical Lasso for improved estimation of Gaussian graphical models,” July. http://arxiv.org/abs/1307.5339.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society (Series B) 58: 267–88.
Yao, Tianyi, and Genevera I. Allen. 2019. “Clustered Gaussian Graphical Model via Symmetric Convex Clustering.” In 2019 IEEE Data Science Workshop (DSW), 76–82. https://doi.org/10.1109/DSW.2019.8755774.
Yuan, Ming. 2010. “High Dimensional Inverse Covariance Matrix Estimation via Linear Programming.” Journal of Machine Learning Research 11 (79): 2261–86. http://jmlr.org/papers/v11/yuan10b.html.
Yuan, Ming, and Yi Lin. 2007. Model selection and estimation in the Gaussian graphical model.” Biometrika 94 (1): 19–35. https://doi.org/10.1093/biomet/asm018.
Zhao, Tuo, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. 2012. “The Huge Package for High-Dimensional Undirected Graph Estimation in r.” The Journal of Machine Learning Research 13 (1): 1059–62.

Reuse

Citation

BibTeX citation:
@article{sanou,
  author = {Edmond Sanou and Christophe Ambroise and Geneviève Robin},
  editor = {},
  title = {Inference of {Multiscale} {Gaussian} {Graphical} {Model}},
  journal = {Computo},
  date = {},
  url = {https://github.com/desanou/multiscale_glasso},
  doi = {xxxx},
  langid = {en},
  abstract = {Gaussian Graphical Models (GGMs) are widely used for
    exploratory data analysis in various fields such as genomics,
    ecology, psychometry. In a high-dimensional setting, when the number
    of variables exceeds the number of observations by several orders of
    magnitude, the estimation of GGM is a difficult and unstable
    optimization problem. Clustering of variables or variable selection
    is often performed prior to GGM estimation. We propose a new method
    allowing to simultaneously infer a hierarchical clustering structure
    and the graphs describing the structure of independence at each
    level of the hierarchy. This method is based on solving a convex
    optimization problem combining a graphical lasso penalty with a
    fused type lasso penalty. Results on real and synthetic data are
    presented.}
}
For attribution, please cite this work as:
Edmond Sanou, Christophe Ambroise, and Geneviève Robin. n.d. “Inference of Multiscale Gaussian Graphical Model.” Computo. https://doi.org/xxxx.